Actionable absolute risk prediction of atherosclerotic cardiovascular disease based on the UK Biobank

Cardiovascular diseases (CVDs) are the primary cause of all death globally. Timely and accurate identification of people at risk of developing an atherosclerotic CVD and its sequelae is a central pillar of preventive cardiology. One widely used approach is risk prediction models; however, currently available models consider only a limited set of risk factors and outcomes, yield no actionable advice to individuals based on their holistic medical state and lifestyle, are often not interpretable, were built with small cohort sizes or are based on lifestyle data from the 1960s, e.g. the Framingham model. The risk of developing atherosclerotic CVDs is heavily lifestyle dependent, potentially making many occurrences preventable. Providing actionable and accurate risk prediction tools to the public could assist in atherosclerotic CVD prevention. Accordingly, we developed a benchmarking pipeline to find the best set of data preprocessing and algorithms to predict absolute 10-year atherosclerotic CVD risk. Based on the data of 464,547 UK Biobank participants without atherosclerotic CVD at baseline, we used a comprehensive set of 203 consolidated risk factors associated with atherosclerosis and its sequelae (e.g. heart failure). Our two best performing absolute atherosclerotic risk prediction models provided higher performance, (AUROC: 0.7573, 95% CI: 0.755–0.7595) and (AUROC: 0.7544, 95% CI: 0.7522–0.7567), than Framingham (AUROC: 0.680, 95% CI: 0.6775–0.6824) and QRisk3 (AUROC: 0.725, 95% CI: 0.7226–0.7273). Using a subset of 25 risk factors identified with feature selection, our reduced model achieves similar performance (AUROC 0.7415, 95% CI: 0.7392–0.7438) while being less complex. Further, it is interpretable, actionable and highly generalizable. The model could be incorporated into clinical practice and might allow continuous personalized predictions with automated intervention suggestions.


Introduction
Globally, cardiovascular diseases (CVDs) are the number one cause of all death [1,2]. In 2016, 17.9 million people died of CVDs alone, accounting for 31% of all global deaths [1] under accession/application number 34802. Interested parties can apply for the data from UK Biobank directly, at http://www.ukbiobank.ac.uk. The UK Biobank will consider data applications from bona fide researchers for health-related research that is in the public interest. This process for reader access to the UK Biobank is the same process as followed by the authors of this study and will provide the same data. All extracted columns are stated in the supporting information files S1 Table and S2 Table. version disregarding all cholesterol risk factors as well as systolic blood pressure, in order to provide a simple approach for risk prediction in remote settings with limited testing resources [32]. However, survival models such as the proportional hazard model are not designed to provide absolute risk estimates for individual patients. Machine learning (ML) based approaches have many advantages compared to humans or standard statistical algorithms, such as superior performance, being able to identify complex non-linear patterns, the ability to encode diverse and high dimensional data types, being more stable to outliers, allowing continuous model updates, versatility for different domains and scalability [33][34][35][36].
However, classic disadvantages of ML based approaches are their lack of interpretability, risk for inherent bias due to the used data, difficulty to acquire physician adoption, explaining to physicians why a new risk model might be superior to existing ones, with all of these hindering widespread adoption of ML based risk prediction models [36,37]. One example for ML based CVD risk prediction is the AutoPrognosis based approach, where an ensemble of multiple ML pipelines has also been applied on the UK Biobank dataset for 5-year CVD risk prediction [29]. Further, using a purely ML-driven approach can lead to a model that requires too many risk factors to compute risk, which is infeasible for routine clinical check-ups. Another disadvantage of purely data-driven approaches is the inclusion of risk factors which might show strong correlations but are unrelated to the pathophysiology of CVDs or are not actionable, making them inapplicable in a clinical setting or as an actionable self-management tool [29].
The aim of this study was to use a large-data ML approach to develop an actionable absolute risk prediction tool which considers the holistic health of an individual. Uniquely, we focused on behavioral risk factors relating to all atherosclerotic CVD outcomes. Our goal was to have a holistic understanding of an individual's current health status, to better quantify their risk of atherosclerotic CVDs, and to provide actionable advice. Our approach is novel in that we employ a highly holistic understanding of an individual's current health status, to better quantify their risk of all atherosclerotic CVDs. By utilizing a comprehensive set of lifestyle factors, we enable the subsequent suggestion of personalized and actionable advice relating to unhealthy risk factors. Instead of using only a limited set of risk factors, we aimed to achieve this by taking multiple biological layers into account, which include: (i) multi-omics data from blood samples (e.g. lipidome and proteome); (ii) family history (e.g. genome), (iii) lifestyle data, (iv) clinical data and (v) environmental data; along with (vi) an extensive set of risk factors and outcomes.
We used data from 464,547 participants of the UK Biobank study who did not have atherosclerotic CVD at their baseline visit. We created an automated pipeline to benchmark risk prediction classifier algorithms against each other, then evaluated their predictive performances in the overall population and tested the generalizability of the top-performing classifiers through retraining and testing on different sub-populations. We explored the clinical implications of the proposed classifiers, with a focus on the top-performing models. This study does not focus on the algorithmic aspects of the utilized classifiers.
Methodological details on the utilized classifiers can be found in the open-source documentation of the respective algorithms of the scikit-learn [38] and xgboost [39] libraries and in the supporting information (S4 Table).

Materials and methods
Baseline data from the UK Biobank was utilized to extract an extensive set of risk factors and outcomes associated with the pathophysiology of atherosclerotic CVDs. A benchmarking pipeline was used to train and evaluate different standard and ML algorithms for the task of 10-year atherosclerotic CVD risk prediction. The performance was measured using AUROC and compared against the baseline models Framingham and QRisk3, which are widely used and recommended models. We evaluated our best performing models further by analyzing the most informative features and assessed model generalizability and created a reduced model.

Study design and participants
The UK Biobank is a long-term prospective large-scale biomedical database including over 500,000 participants aged 40-69 years (when recruited between 2006 and 2010). The database is globally accessible to approved researchers undertaking research into the most common and life-threatening diseases and continuously collects phenotypic and genotypic data about its participants, including data from questionnaires, physical measures, blood, urine and saliva samples, lifestyle data [40]. This data is further linked to each participant's health-related records, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal followup data for a wide range of health-related outcomes [40,41]. The UK Biobank study protocol is available online [42].
The North West Multi-Centre Research Ethics Committee approved the UK Biobank study and all participants provided written informed consent prior to study enrollment. Our research is covered by the UK Biobank's Generic Research Tissue Bank (RTB) Approval and was approved by the UK Biobank Access Management Team [43].
We excluded participants with atherosclerotic CVDs present before or during baseline, participants who chose to leave the UKB study and participants who were lost due to various reasons. The resulting cohort consisted of 464,547 participants. The last available date of participant follow-up was March 5th, 2020.
Risk factor definition. We curated a list of all generally known risk factors and outcomes for atherosclerotic CVDs from the medical literature and from validated risk prediction models. This preliminary list of risk factors was reduced through curation to focus on those factors that were clearly involved in the pathophysiology of atherosclerosis and those that are modifiable through behavioral change. The curation was carried out by three medical doctors with experience in diagnosing or scientifically modelling cardiovascular diseases. We consolidated all relevant UKB columns into 203 risk factors and grouped them into six categories: demographics (e.g. age, biological sex, ethnicity), biomarkers (e.g. cholesterol, glucose, blood pressure, heart rate), lifestyle (e.g. alcohol consumption, smoking, physical activity, sleep, social visits), environment (e.g. exposure to tobacco smoke, work and housing and other socio-economic related factors), genetics (e.g. family history of CVD, stroke, diabetes, high cholesterol, high blood pressure) and comorbidities (e.g. heart arrhythmias, diabetes, acute & chronic kidney injury, migraines, rheumatoid arthritis, systemic lupus erythematosus, severe mental illnesses (schizophrenia, bipolar disorder, depression, psychosis), diagnosis or treatment of erectile dysfunction, atypical antipsychotic medication). A categorized list of all risk factors used in our analysis is provided in the supplementary data (S1 Table).
Outcome definition. In the same manner as described above, an initial list of atherosclerotic CVDs was further reviewed and curated by the same team of medical doctors. All resulting CVDs of interest are associated with atherosclerotic plaque build-up, are modifiable and relate to the collected risk factors only. Thus, we disregard brain haemorrhages due to accidents and congenital and pregnancy-related CVDs, which are not actionable. The curated list of all ICD-10 and ICD-9 outcomes meeting the above criteria consists of 193 total (125 unique) CVD outcomes, e.g. coronary/ischaemic heart disease, heart attack, angina, stroke, cardiac arrest, congestive heart failure, left ventricular failure, myocardial infarction, aortic valve stenosis, cerebral artery occlusions, nontraumatic haemorrhages. A list with all outcome codes used in our analysis is provided in the supplementary data (S2 Table). An atherosclerotic CVD event was defined as the first occurrence out of the following: any of the atherosclerotic CVD outcome diagnosis codes, also as primary or secondary death cause during the 10-year followup period.
Cohort follow-up. Follow-up time was set to 10 years as commonly used in other risk models (see Table 2 in [7]) and counted from the date of initial assessment center visit. Individuals who died from other causes during their follow-up period or had a relevant CVD event past their individual follow-up period, were marked as not having had a relevant CVD event.

Models used in comparison
Framingham risk score. The Framingham 10-year CVD absolute risk score is based on the data of the two prospective studies, the Framingham Heart Study and the Framingham offspring study [27]. The cohort consists of 8491 participants, with 4522 women and 3969 men who attended a baseline examination between 30 and 74 years of age and were free of CVD. A positive CVD outcome was defined as any of the following: coronary death, myocardial infarction, coronary insufficiency, angina, ischemic stroke, hemorrhagic stroke, transient ischemic attack, peripheral artery disease and heart failure.
Participants were followed up for 12 years where 1174 participants developed a CVD. Two biological sex-specific risk models were derived, with one model using lipid measurements and the other one Body Mass Index (BMI). The variables used were biological sex, age, total cholesterol, HDL cholesterol, treated and untreated systolic blood pressure, smoking status and diabetes status.
The Framingham risk calculators and model coefficients are publicly available [44]. We imputed missing data using simple mean imputation.
QRisk3. The QRisk3 10-year CVD absolute risk score is based on a prospective open cohort study using data from general practices (GPs), mortality and hospital records in England [28]. The cohort consists of 10.56 million patients between the age of 25 and 84 years, where 75% of the patients were used for training and 25% for validation. Patients with a preexisting CVD, missing Townsend score or using statins were removed from the baseline. Patients were classified as having a positive CVD outcome when any of the following outcomes was present during follow-up in the GP, hospital or mortality records: coronary heart disease, ischaemic stroke, or transient ischaemic attack. QRisk3 used the following ICD-10 codes: G45 (transient ischaemic attack and related syndromes), I20 (angina pectoris), I21 (acute myocardial infarction), I22 (subsequent myocardial infarction), I23 (complications after myocardial infarction), I24 (other acute ischaemic heart disease), I25 (chronic ischaemic heart disease), I63 (cerebral infarction), and I64 (stroke not specified as haemorrhage or infarction). The utilized ICD-9 codes were: 410, 411, 412, 413, 414, 434, and 436. Participants were followed-up for 15 years where 363,565 participants of the training set (4,6%) developed a relevant CVD. One biological sex-specific risk model was derived.
The risk factors used in the final model were age, ethnicity, deprivation, systolic blood pressure, BMI, total cholesterol/HDL cholesterol ratio, smoking status, family history of coronary heart disease, diabetes status, treated hypertension, rheumatoid arthritis, atrial fibrillation, chronic kidney disease, systolic blood pressure variability, diagnosis of migraine, corticosteroid use, systemic lupus erythematosus, atypical antipsychotic use, diagnosis of severe mental illnesses, diagnosis or treatment of erectile dysfunction.
The QRisk3 risk calculator and model coefficients are publicly available [45], built into all major NHS GP systems and included in the UK's national guidelines (https://www. healthcheck.nhs.uk/seecmsfile/?id=1687, accessed 10th November 2021). We imputed missing data using simple mean imputation.

Standard linear and ML models.
Since the introduction of the classic CVD risk prediction methods, the field of supervised machine learning has developed from classical statistics with the sole purpose of maximizing predictive accuracy with modern statistical methods. Therefore, in addition to using standard linear models, we tested the major ML approaches, covering a wide spectrum of the possible ML design space, to evaluate which model type performs best for our task. Based on our initial benchmarking pipeline results, we focused on reporting the results of the initially best performing models: logistic regression, random forest and XGBoost.
We compared regularized linear regression (with L1 penalty), random forests and gradient boosting (xgboost implementation) for assessing the highest achievable Area Under the Receiver Operating Characteristic Curve (AUROC) value, which we used for assessing the trade-off between number of features and predictive performance of several simpler practical risk predictors, as determined by an iterative feature elimination procedure outlined below. L1 regularization for logistic regression implements a strong penalty for non-zero feature weights, resulting in a feature selection procedure that discards features that are likely to be non-predictive. Random Forest is an ensemble method that fits many decision trees independently to a subset of the data. We implemented both methods using their scikit-learn library implementation. Finally, we evaluated Extreme Gradient Boosting: Gradient boosting is an ensemble treebased machine learning method that combines many weak classifiers to produce a stronger one. It sequentially fits a series of classification or regression trees, with each tree created to predict the outcomes misclassified by the previous tree [46]. By sequentially predicting residuals of previous trees, the gradient boosting process has a focus on predicting more difficult cases and correcting its own shortcomings. Extreme Gradient Boosting (XGB / XGBoost) is a specific implementation of the gradient boosting process, and uses memory-efficient algorithms to improve computational speed and model performance [39,47].

Model development and benchmarking using pipeline
We built a benchmarking pipeline for automated and reproducible data extraction, normalization, imputation, model training, tuning of model hyperparameters, classification, documentation and reporting.
We implemented all models using their respective scikit-learn library or xgboost library implementation using the Python programming language [38,39]. Details on the used Python libraries, methods and parameters are provided in the supplementary data (S3 and S4 Tables).
Categorical values were one-hot encoded. Data normalization was performed by removing the mean and scaling to unit variance. Data imputation was performed for all models using a simple mean imputation. The models' hyper-parameters were determined using grid search and stratified k-fold cross validation using 3 folds was employed to avoid overfitting.
Finally, we assessed model performance mainly using the AUROC. Fig 1 visualizes an overview of all performed steps of our experimental setup.
Iterative feature elimination. We employed an iterative feature elimination procedure based on the regularized logistic regression for finding the best trade-off between predictive performance and number of risk factors, with the aim of creating a risk prediction algorithm that is applicable in the clinical context. We used the standard L1 regularization (also known as Lasso) proposed by [57]; it implements a strong penalty on non-zero feature weights of our logistic regression model, resulting in a sparse feature set for prediction.
A logistic regression coefficient value β can be interpreted as the expected change in log odds of having the outcome per unit change in the feature x β . Therefore, increasing the feature by one unit multiplies the odds of having the outcome by e β . This means that we can interpret the coefficients as feature importance values in the sense that the feature with the smallest coefficient has the least importance on model predictions. Importantly, this holds only true in the context of the parameters contained in the current model. Thus, we re-estimate the model after each feature elimination round.
In each iteration, we re-estimated the logistic regression model on the remaining parameters, and then discarded all parameters that were set to zero by the L1 regularization; finally, we also discarded the parameter with the lowest non-zero absolute value.
As an additional step, we created a ranking of the relative feature importance value of each feature by dividing its absolute coefficient weight by the sum of all absolute coefficient weights.
Statistical analysis. To reduce overfitting, we evaluated the classification performance of all our benchmarked algorithms by using 3-fold stratified cross-validation and measuring the Area Under the Receiver Operating Characteristic Curve. For the cross-validation, we used a training set with 325,182 participants to train and derive our standard linear and ML models and then assessed the AUROC performance on the held-out test set with 139,365 participants using 203 risk factors respectively. We reported the AUROC and the 95% confidence intervals (Wilson score intervals) for all models and performed a sensitivity analysis using Shapley Additive Explanations (SHAP values) for the best performing linear model.
Generalizability. With 442,620 out of the 502,551 participants in the UK Biobank, the cohort has a high proportion (88.1%) of participants with British White ethnicity. In an effort to estimate a proxy for out-of-sample generalizability, we re-trained the two best models, XGB and logistic regression with L1 regularization, only on Whites and tested their performance on

Characteristics of the training and test populations
Of 502,551 patients in the UK Biobank, we filtered out 7.6% who already experienced a relevant CVD outcome (during or before baseline) and the participants being lost or who withdrew from the biobank. This resulted in 464,547 participants who met the inclusion criteria. 28,561 (6.1%) of those participants developed at least one of the relevant CVD outcomes during their 10-year follow-up period. We used a common 70% of the data as a training set and 30% as a hold-out test set. Table 1 shows the overlap of our atherosclerotic CVD outcome definition with the CVD outcome definition used in the related work approach by Alaa et al. [29]:

Prediction accuracy
The resulting prediction accuracy of the benchmarked models is depicted in Table 2. We used both Framingham 10-year CVD risk versions, with and without lipids, as well as QRisk3 as baseline models to assess the performance of predicting someone's 10-year risk of developing an atherosclerotic cardiovascular disease based on a holistic set of risk factors, with a focus on actionable risk factors and outcomes. The best performing model was XGB with an AUROC of 75.73%, only marginally higher than the logistic regression model with L1 regularization (75.44%) and substantially better than the Random Forest model (66.90%). Fig 2 shows the AUROCs of the best performing models XGB and from logistic regression with L1 regularization, which is the simplest model tested and amongst the top two best performing models. Logistic regression comes with the advantages of being interpretable by providing reasoning for its classifications, and being a simple and robust method [36].
In order to better evaluate the clinical implications and significance of our results, we compared the results of our benchmarked models with our baseline models Framingham and

Statistic measured Number
No. of atherosclerotic CVD outcomes that developed in 10-year follow-up according to definition in current study 28,561 No. of CVD outcomes that developed in 10-year follow-up according to comparator study definition 28,242 No. of CVD outcomes after 10-year follow-up that overlap in the current study and comparator study definition 456,184 out of 464,547 (98%) No. of CVD outcomes identified in the current study but not in comparator studies 4,341 No. of CVD outcomes included in comporator studies, but not in current study 4,022 https://doi.org/10.1371/journal.pone.0263940.t001 QRisk3. Table 2 shows that both our XGB and logistic regression classifiers achieved superior performance compared to the baseline models. Apart from the Random Forest model, all tested models had a higher AUROC than both baseline Framingham (68.0% and 68.1%) and QRisk3 (72.5%) models. The difference in AUROC performance of the Framingham score in our experiments in Fig  2 compared to Alaa et al. [29] is explainable by their use of an older UK Biobank version with 40,000 fewer baseline patients with their last available date of participant follow-up being February 17, 2016. The UK Biobank version we used includes biochemistry data which was released May 1, 2019 including cholesterol and additional questionnaires data. Additionally, more diagnosis data was made available over time. These dataset differences may help explain the difference in AUROC. Figs 3 and 4 show the AUROCs of all baseline models on imputed and unimputed data respectively.
Both Framingham versions perform nearly identically on imputed and unimputed data whereas QRisk3 performs worse on unimputed data.  We also assessed the performance for fewer features. To reach the same performance as QRisk3 of 72.5% AUROC, 16 features would be necessary. The two most informative features were age and biological sex. To reach a similar performance as Framingham (68.0%), just two features were necessary (68.98%). It is worth noting, however, that both Framingham and QRisk3 were trained and tuned on other datasets and have different CVD definitions and objectives.

Generalizability of results
We assessed the generalizability of our models by re-training the two previously best performing models only on a White cohort and then testing them on a non-White cohort. Table 4 and  Table 5 shows the relative regression feature weights of the 25 most informative risk factors in descending order. A full list is provided in the supplementary materials (S5 Table). Based on our previous manual curation of risk factors and outcomes, we can see that the most informative risk factors are distributed across 5 categories (Table 6), with the lifestyle category contributing the most risk factors. The two most informative features were age and biological sex. We provided a sensitivity analysis using SHAP values of the best performing logistic regression model for all risk factors in the supplementary materials (S1 Fig).

Discussion
Using data gathered from the large longitudinal cohort UK Biobank study, we developed a pipeline to benchmark several classification models for predicting a subject's 10-year absolute risk of developing an atherosclerotic CVD. We used an extensive set of physician curated risk factors and outcomes methodology, employing a holistic view of the subject's current health status rooted in a precision medicine approach. The models were trained and evaluated using data from 464,547 UK Biobank participants, spanning 203 CVD risk factors for each subject. Using a simple logistic regression model with a holistic set of risk factors significantly improved the accuracy of atherosclerotic CVD risk prediction compared to currently available, widely used and recommended models such as Framingham and QRisk3. Both of these existing models rely on a limited set of risk factors and outcomes and do not focus on modifiable lifestyle factors. Further, our best performing logistic regression model utilizes new CVD risk predictors showing high predictive power, namely: social visits, walking pace and overall health rating. The frequency of social visits could be indicative of someone's current mental health status, which has been shown to be a relevant CVD risk factor [58,59]. These and other non-laboratory risk factors could be collected by means of a questionnaire or passively deduced using data analytics from data sources such as GPS, calendar and sensors [26,60] from e.g. smartphones, smartwatches and fitness trackers.

PLOS ONE
Additionally, our best performing models, XGBoost and logistic regression, showed marginal differences when trained and tested on particular sub-populations, which is indicative of good generalizability to other ethnicities.
As there was little performance difference between the best performing models, we primarily discuss the simplest model, logistic regression with L1 regularization. This model has the inherent benefit of offering reasoning for its predictions through analyzing the learned coefficients for every risk factor and having feature selection performed by the L1 regularization. With L1 regularization, less important risk factors' coefficients are minimized and also set to zero, which then leads to entire removal of these features from the model, and fewer risk factors needed for an accurate prediction. Using iterative feature elimination, we identified a subset of the 25 most relevant risk factors providing a similar performance compared to using all 203 risk factors. The 25 most relevant risk factors are distributed across five different categories, suggesting that different biological layers contribute to the risk of atherosclerotic CVD. This result confirms that it is insufficient to assess only one biological layer for accurate risk prediction, supporting our initial model development approach [61]. Our approach takes into account multiple biological layers by using multi-omics as well as clinical and lifestyle data with the aim to capture all potential interactions or correlations detected between molecules in different biological layers [22]. Multi-omics data generated for the same set of samples can provide useful insights into the interaction of biological information at multiple layers and thus can help in understanding the mechanisms underlying the complex biological condition of interest.
In our model, the lifestyle category contributed the most risk factors, suggesting that accurate prediction relies upon continuous daily lifestyle data and not just periodic snapshots of clinical data. The causal relationships between the risk factors considered in our model and atherosclerotic CVDs have been demonstrated by other studies [11,19,21,25].
Innovative approaches are needed in order to tackle the increasing prevalence and mortality of CVD-related diseases [2], and the associated healthcare systems' financial burdens. This is particularly true in low and middle income countries where CVD prevalence has also been increasing and is expected to increase as a consequence of an aging and growing population [2]. Our atherosclerotic CVD prediction model has the potential to support healthcare systems by identifying more people at risk earlier and more accurately than currently available models and intervening with personalized behavior change programs. Currently available models, like Framingham and QRisk3, have limited predictive capability for atherosclerotic CVDs as they were not trained on all of them and do not provide actionable results.
There is potential for novel disruptive approaches to affordably improve CVD outcomes. Areas where this may have an impact is in novel approaches to screening, lifestyle coaching and prevention [2]. Screening will become more accessible and widespread by more (near-) medical-grade sensors being integrated into smartphones and smartwatches, enabling continuous monitoring of relevant behavioral CVD risk factors, as well as biomarkers such as heart rate, blood pressure and blood glucose. By gathering a wider spectrum of relevant risk factors for cardiovascular disease automatically and continuously, an ongoing and personalized cardiovascular disease risk prediction could be enabled. Through linking personalized information on an individual's CVD risk with app-based programs for sustained behavioral modification, it may be possible to lower the incidence and mortality of CVDs [62]. Combined with a companion smartphone-based app, an AI or healthcare provider-generated personalized intervention program could be provided and targeted at those people who need it the most.
A system and method gathering personal health data and predicting an individual's atherosclerotic CVD risk is handling sensitive health data (e.g. laboratory values) and must adhere to local regulations and best practices in data transfer, processing and storage to ensure data privacy and security.
Many studies have shown that digital health interventions are cost effective for managing CVD (for a review see [63]). One report found that a community-based prevention program could have a mean return on investment (ROI) on medical cost savings of $5.60 for every $1 spent within a 5 year timeframe by improving physical activity and nutrition and reducing tobacco usage [64]. A review of 11 in-home cardiac rehabilitation programs for the secondary prevention of CVD found that social support, goal setting, monitoring, credible instructions and literature resources are all effective behavior change techniques to reduce behavioral risk factors for CVD [65].
The improvement achieved by our models might be partially attributed to being trained and assessed on the UK Biobank dataset, whereas the baseline Framingham model was derived from a different population. The population and many of the data sources used in the QRisk3 model are similar, being the general UK population and using their GP, hospital and mortality records. However, our risk model generation approach and QRisk3's approach were designed with different aims and objectives and the modelling strategy was different. For these reasons, direct comparison between the models is limited. Notable differences between the approaches include a more limited set of risk factors included in Framingham and QRisk3's and a focused and wider range of atherosclerotic CVDs included in our approach.
The results from our generalizability sub-analysis indicate that our XGB and logistic regression models might generalize well to other ethnicities and do not overfit to our cohort, however, this needs to be further evaluated with more data from diverse ethnicities.
Our results show that our models have improved performance over the baseline models Framingham and QRisk3 (Table 2). This is because the selection of the appropriate disease modelling approach, classifiers and careful tuning of the model's hyperparameters are crucial steps for realizing the potential benefits of ML. Our pipeline automates some of these steps which makes the tuning and discovery of new disease risk models easily accessible for clinical research. Our prospective cohort modelling approach, which is rooted in precision medicine, is the first to generate an atherosclerotic CVD absolute risk prediction tool based upon a complete definition of atherosclerotic CVD outcomes and a holistic set of risk factors.

Limitations
The UK Biobank only admitted participants for their initial signup from the ages 40 and up. This might limit the applicability of the risk score for younger populations and further tests with data from younger populations need to be conducted.
There are many missing data values related to the potential risk factors for many participants. Having more unimputed data of relevant CVD risk factors could improve the predictive performance of all our benchmarked classifiers and could also lead to changes in the classifier ranking from Table 2 and relative risk factor importances in Table 5. However, the use of imputed data is highly unlikely to have an impact on our conclusion that a holistic set of risk factors and an exhaustive atherosclerotic CVD outcome definition could improve atherosclerotic and actionable CVD risk prediction.
An additional limitation of our study is that the UK Biobank dataset consists of participants of predominantly (88%) British ethnicity, with an even larger portion having a White background (91%). Therefore, further assessments of the influence of the ethnicity predictor need to be carried out to enable a generalizable tool. Previous work in this area indicates that the development of plaques seems to be independent of ethnicity [21].
A further limitation of this UK-focused dataset is that socio-economic and other environmental factors differ between countries. This is another potential bias that needs to be further evaluated with datasets from other countries with different socio-economic characteristics.
Disease risk prediction models which include subjective non-laboratory risk factors, such as the self-reported health rating and usual walking pace, should be cautiously evaluated to minimize self-reported bias. These risk factors have been found to be good predictors of overall CVD risk in another study using UK Biobank data [29].

Conclusions
We benchmarked multiple classifiers to predict an individual's 10-year risk of developing an atherosclerotic CVD, using a holistic set of risk factors and a specific definition of atherosclerotic CVDs. Our reduced logistic regression with L1 regularization classifier, a simple and interpretable model, is amongst our best prediction models, includes actionable lifestyle factors, has great predictive power and requires 13 unique features. Our experiments showed that a two feature-questionnaire is as accurate as the Framingham models and a 16 feature-questionnaire is as accurate as QRisk3 for 10-year atherosclerotic CVD risk prediction. Both prediction models, XGBoost and logistic regression, generalize well to non-White people, which might indicate that our models generalize well to other (western) countries. Framingham and QRisk3, which are well established and validated absolute risk prediction models, do not perform as well on predicting individuals' 10-year risk of developing an atherosclerotic CVD. With our logistic regression model, we created a promising new interpretable, actionable and accurate risk prediction tool that could assist individuals and public health in CVD risk reduction.  Table. List of all outcomes used in our analysis. The following outcomes were all consolidated into one final binary outcome column indicating if the respective UK Biobank participant did or did not develop one the relevant atherosclerotic CVDs during their individual 10-year follow-up period starting from their individual initial assessment attendance date. (XLSX) S3